Compilation of notes for coding with python
Linking with google sheets to have a searchable dictionary for coding with python
Unfortunately, new line within a cell does not show up in DT below, so the formatting isn’t great.
Learning how to use Python in R, using reticulate package
Make a copy of the original dataset before making any changes.
Get a feel of data. Use glimpse() for R, Use my_df.head().T for python. Other commands include: df.tail(), df.index(), df.columns, df.describe(), df.info, df.shape, df.sample(5, random_state = 0)
Check datatype to see if there is any need to change to the right dtype
Check which columns are redundant, and drop if not necessary df_dropped_age_sex = df.drop([‘Age’, ‘Sex’], axis = 1).head(1).T
Check for duplicated rows (only those that match perfectly across all columns) data_iris_preprocessed = data_iris.drop_duplicates()
Check descriptive statistics, and get summary statistics to see if there is a need for scaling, outliers.
Check for unique values to see if any factor-recoding should be done.
Boxplot/Histogram to see if there are any outliers
Check class distribution
Check correlation
Check skews of univariate plots. Histograms may be used to show distribution for numerical data.
Check for outliers - ascertain whether it is due to measurement error, data corruption, or it is a true outlier (requires domain knowledge). It is not a good idea to remove outliers without knowing why. One way to remove outliers is to use the IQR method to check if they are outliers.
For categorical data, counts (barplot) may be used.
For bivariate relationships (numerial/numerical): use scatterplot to show relationship, or jointplot to show both scatterplot and histogram.
For bivariate relationships (numerical Y, categorical X): use boxplots
For multivariate relationships, use pairplot
library(reticulate)
## check configuration
# py_config()
# y_discover_config()
# install packages - https://rstudio.github.io/reticulate/articles/python_packages.html
# py_install('pandas')
# py_install('scipy')
# py_install('matplotlib.pyplot')
# py_install('seaborn')
# py_install('scikit-learn')
Import libraries Indicate chunk code with Python instead of R.
import numpy as np
= np.array([2,4,6,8])
my_python_array my_python_array
array([2, 4, 6, 8])
my_r_array <- py$my_python_array
= r.my_r_array my_python_array_2
For attribution, please cite this work as
lruolin (2022, March 10). pRactice corner: Summary for Python Codes. Retrieved from https://lruolin.github.io/myBlog/posts/20220214 - Compiled Python Notes/
BibTeX citation
@misc{lruolin2022summary, author = {lruolin, }, title = {pRactice corner: Summary for Python Codes}, url = {https://lruolin.github.io/myBlog/posts/20220214 - Compiled Python Notes/}, year = {2022} }